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Facial Expression is a significant role in affective computing and one of 
the non-verbal communication for human computer interaction. Automatic 
recognition of human affects has become more challenging and interesting 
problem in recent years. Facial Expression is the significant features to 
recognize the human emotion in human daily life. Facial expression 


recognition system (FERS) can be developed for the application of human 

affect analysis, health care assessment, distance learning, driver fatigue 
Keywords: detection and human computer interaction. Basically, there are three main 
components to recognize the human facial expression. They are face or face’s 
components detection, feature extraction of face image, classification of 
expression. The study proposed the methods of feature extraction and 
classification for FER. 
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1. INTRODUCTION 

In Artificial Intelligent era, facial expression recognition (FER) is interesting and challenging task 
with the problems of limited dataset, different environments, pose, occlusion, person variation etc. FER 
systems have been applied many systems such as human-computer-interaction (HCI), games, animation of 
data-driven, surveillance, clinical monitoring etc., [1]. Ekman and Friesen, psychologists from America defined 
six universal facial expressions: fear, happiness, anger, disgust, surprise, and sadness and also explored Action 
Units based facial action coding system (FACS) to describe facial features of expressions [2]. Facial 
expressions convey nonverbal communication cues that play a significant role in interpersonal relations. Some 
literatures work adding on other emotions neutral, contempt, and many compound facial emotions. Some 
researchers employed on handcrafted features extracted using algorithms and others employed on complicated 
features extracted using deep learning methods. In this paper, we explored the feature extraction methods, 
feature descriptors, classification methods, methods of feature dimension reduction, frameworks of the facial 
expression recognition system and the comparison of the results. The remainder of the paper is organized as 
follows. In section 2, Literature of current FER system. Typical FER system is shown in Section 3. After that, 
two types feature of facial images is discussed in section 4, and section 5 described facial databases for FER 
system. Section 6 describes the problem statement of FER system. In the last section, conclusion and future 
work is presented. 
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2. LITERATURE OF CURRENT FER SYSTEM 

Used geometric feature extraction, regional local binary pattern (LBP) features extraction, fusion of 
both the features using autoencoders and self-organizing map (SOM)-based classifier. The average accuracy 
97.55% of MMI and 98.95% of CK+ database. The accuracy of SOM-based classifier is significant 
improvement over SVM with 3.94% increase for CK+ and 4.36% for MMI dataset respectively [3]. Explored 
multiple feature fusion applying Histogram of oriented gradients from three orthogonal planes (HOG-TOP) 
with experimentation of three datasets CK+, GEMEP-FERA 2011, and acted facial expression in the wild 
(AFEW) 4.0 [4]. Presented a FER model using Haar cascades face components detection and neural network 
(NN) to train the eye and adding mouth features on JAFFE Japanese database. Comparison of the result of 
proposed method with Sobel edge detection methods is that the system has achieved more good accuracy. The 
problem of illumination and pose of the image and to make fully meet theory and practical requirements by 
integrating other biometric authentication methods and HCI perception methods is still existed [5]. Examined 
emotion recognition system using hybrid feature descriptors combining spatial Bag of features and spatial 
scale-invariant feature transform (SBoF-SSIFT) and classifiers of K-nearest neighbor. Codebook construction 
is applied after features extraction to represent large feature sets by grouping similar features into a specified 
cluster number. The experimentation accuracy has showed 98.33 and 98.5% on JAFFE and extended cohn- 
canade (CK+) dataset respectively. However, the recognition performance depends on the number of clusters 
for codebook generation, number of detected features, levels for image segmentation, and size of training 
dataset [6]. Implemented cognition and mapped binary pattern-based FER using basic emotion model and 
circumplex model on CK+ with 100 images for training and 50 images for testing. In the preprocessing step, 
unwanted information such as hair, ear, and background are removed from the facial image. LBP and pseudo 
3D model are used to extract the facial contours and to segment face area into sub-regions. To reduce the 
dimension of the features mapped local binary pattern is employed and then used two classifiers of SVM and 
softmax. The result found that local features and expressions are correlated. Moreover, the two classifiers have 
a little difference in performance. The existence of occlusion, complex conditions, and micro-expression 
recognition will be conducted in future FER system [7]. Proposed a method Angled Local Directional Pattern 
(ALDP) for texture analysis of facial expression with six classifiers k-NN, SVM, DT, RF, Gaussian NB and 
Perceptron on CK+ dataset. Firstly, facial image was detected using Haar-like as [5] and then cropped and 
normalized the detected image. The accuracy improved 99% with ALDP method with no preprocessing [8]. 
Also proposed Grey Wolf optimization for feature selection and GWO-neural network (GWO-NN) for feature 
classification. The parts of face eyes, nose, mouth and ears are detected using Viola-John algorithm and then 
SIFT feature extraction is used feature points. The accuracy 89.79% on CK+ is less than [8] and achieved 
91.22% [9]. Proposed a framework with high-dimensional features combination of appearance and geometric 
features. The system used deep sparse autoencoders (DSAE) to learn robust discriminative feature and active 
appearance model (AAM) to locate the facial landmarks 51 points. Three feature descriptors HoG, gray value 
and LBP are utilized to describe the local features. Linear dimension reduction method of PCA is used to 
compress the features and then give the map as the input of DASE. The accuracy of the proposed framework 
achieved 95.79% of CK+ dataset by using leave on subject out cross-validation method [10]. 

Presented three models of differential geometric fusion network (DGFN) with extraction of 
handcrafted features, deep facial sequential network (DFSN) based on CNN with auto-extracted features, and 
DFSN-1 combination of the advantages of DGFN and DFSN by mapping and concatenation of handcrafted 
and auto-extracted features. DFSN-1 achieved the best performance among the three models on all of CK+, 
Oulu-CASIA and MMI dataset [11]. Used deep convolutional neural network (DCNN) using caffe framework 
and Telsa K20Xm GPU. The frontal face is detected and cropped applied by openCV in facial images 
preprocessing from CK+ and JAFFE. The accuracy of experiment achieved 97% with leave-one-subject-out 
cross validation on CK+ and 98.12% with 10-folds cross validation on JAFFE [12]. Presented three models of 
differential geometric fusion network (DGEFN) with extraction of handcrafted features, deep facial sequential 
network (DFSN) based on CNN with auto-extracted features, and DFSN-1 combination of the advantages of 
DGEN and DFSN by mapping and concatenation of handcrafted and auto-extracted features. DFSN-1 achieved 
the best performance among the three models on all of CK+, Oulu-CASIA and MMI dataset [11]. Used deep 
convolutional neural network (DCNN) using caffe framework and Telsa K20Xm GPU. The frontal face is 
detected and cropped applied by openCV in facial images preprocessing from CK+ and JAFFE. The accuracy 
of experiment achieved 97% with leave-one-subject-out cross validation on CK+ and 98.12% with 10-folds 
cross validation on JAFFE [12]. Reviewed analysis of 22 Local Binary Pattern variances on JAFFE and CK 
databases using the simple parameter-free nearest neighbor classifier (1-NN). For JAFFEE database, the 
highest recognition accuracy achieved 97.14% by using dLBPa, ELGS and LTP, while CK database, the 
highest recognition rate of 100% by using AELTP, BGC3, CSALTP, dLBPa, nLBPd, STS, and WLD 
discriptors. The basic LBP descriptor achieved the acceptable performance of 95.71% on JAFFE and 99.28% 
of CK database. The study can be extended including other problems and other datasets. 
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Used DCNN adding data augmentation, cross entropy and L2 multi-class SVM [13]. In [14], weighted 
center regression adaptive feature mapping (W-CR-AFM) for feature distribution and CNN for feature training 
on CK+, Radbound Faces database (RaFD), Amsterdam dynamic facial expression set (ADFES) and 
proprietary database. Different of other papers, spatial normalization and feature enhancement preprocessing 
methods are used. The recognition obtained 89.84%, 96.27%, 92.70% for CK+, RaFD and ADFES 
respectively. Address illumination problem of real-world facial images using fast fourier transform and contrast 
limited adaptive histogram equalization (FFT+CLAHE) for poor illumination and then applied merged binary 
pattern code (MBPC). PCA is used as a method of feature dimension reduction and k-NN as a classifier on 
SFEW dataset [15]. Released a new database iCV-MEFED at FG work-shop. Multi-modality CNN is compared 
with CNN for micro emotion recognition in the paper. The proposed network extracted firstly visual and 
geometrical information of features then concatenated these into a long vector. The feature vector is fed to the 
hinge loss layer. The framework is better performance than CNN with the misclassification of 80.212137 using 
caffe [16]. Also proposed another three works of the work-shop. The first winner method using CNN with 
geometric representation of landmark displacement leading better results compared with texture-only 
information. The recognition accuracy achieves 51.84% for seven expressions and 13.7% for compound 
emotion with the performance of average time 1.57ms using GPU or 30ms using CPU [17]. 

Employed deep emotional attention model using cross channel CNN by adding attention modulator 
on the bimodal face and body (FABO) benchmark database. The system applied CNN to learn the location of 
face expressions in a cluttered scene. The study has shown that the experimentation of one expression attention 
mechanism and two expression attention mechanism. The accuracy of the framework with attention is better 
than that of without attention [18]. Proposed a robust facial landmark extraction method by combining data- 
driven of fully convolution network (FCN) and model-driven of pre-trained point distribution model (PDM) 
with three steps estimation-correction-tuning (ECT). The computation of response maps of global landmark 
estimation is trained by FCN and then the maximum points of the maps are fitted with PDM to generate initial 
facial shape. In the final, a weighted version of regularized landmark mean-shift (RLMS) is applied to fine- 
tune the facial shape iteratively [19]. 

Designed to learn NN architecture with three loss functions fully supervised, weekly supervised and 
hybrid regularization. The experimentation of the proposed model has achieved promising results on CK+, 
JAFFE under lab-environment and SFEW in the wild [20]. Proposed transductive deep transfer learning 
(TDTL) architecture to address the problem of cross-database non-frontal facial expression recognition 
applying VGGface 16-Net on BU-3DEF and Multi-PIE datasets. The study found that feature representation 
with VGG network is better than traditional handcrafted features such like SIFT and LBP to represent 
complicated features [21]. [22] Also used the two datasets for the experimentation to address the problem of 
cross-domain and cross-view of facial expressions using transductive transfer regularized least-square 
regression (TTRLSR) model, color SIFT (CSIFT) features with 49 landmarks and SVM classifiers. The two 
databases have only four identical categories neutral, surprise, happy and disgust. The experimentation of the 
study conducted two kinds cross-domain and same view and cross-view and same domain. PCA algorithm also 
applied to reduce the features dimension. 

The studies in references [3, 5-7] classified six universal emotions as happiness, angry, sadness, 
surprise, fear, and disgust. In [9, 13, 15, 23-24] have classified one more class as neutral and [8, 17, 23] have 
done contempt class. All of eight classes have been classified by the studies in [11, 10, 16]. However, [21] and 
[22] have worked on neutral, happiness, surprise and disgust expressions. Chen et al. [4] employed with 5 
classes of GEMEP-FERA 2011 database and 7 classes of CK+ and AFEW. Li et al. [25] explained seven basic 
emotions and 11 compound emotions sadly angry, sadly surprised, sadly fearful, happily surprised, happily 
disgusted, sadly disgusted, fearfully surprised, fearfully angry, angrily surprised, angrily disgusted and 
disgustedly surprised. Ferreira et al. [20] has worked classification 6 universal classes of JAFFE, SFEW with 
classes of 6 basic and neutral, and CK+ with 8 classes including contempt. 


3. TYPICAL FER SYSTEM 

Typical FER system is showed in the following system flow Figure 1. In the detection of face consists 
of three works: locate the face, crop the face, and scale the face. Features extraction methods, dimension 
reduction method and classification methods could be selected. 
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Figurel. Typical FER System 


4. FEATURES OF FACIAL IMAGES 
Most of the FER system used geometrical features or visual features or both of these features to extract 
the features from the images of faces. 


4.1. Geometrical features 

Geometrical methods can estimate facial landmarks location or some components of facial images 
such as the eyebrows, the mouth, and the nose and these features can be measured by distances, curvatures, 
deformations, and other geometric properties to represent the geometric facial features as they are sensitive to 
noise [3-4, 9, 16-17]. The paper [9] described facial point extraction method to extract the points of eye, nose, 
mouth, and ears based on Viola-Jones object detection algorithm. Four key regions of face are used to extract 
geometric features with four steps: detect face, detect eyes, locate eye center then get eye region height, and 
estimate nose and lips regions. In the paper [17], facial landmark displacement method is applied to extract 
geometrical information. Affective geometric features are extracted using the warp transformation of facial 
landmarks to capture the configuration of facial landmark in [4]. Facial landmark with 68 points is described 
as geometrical representation of face [16]. 


4.2. Apperance features 

Appearance methods such as scale invariant feature transform (SIFT), Gabor appearance, local phase 
quantization can detect the multi-scale, multi-direction of the local texture changes on either specific regions 
or the whole face to encode the texture [3-4, 8-9, 16]. In [7], mapped local binary pattern with four 
neighborhoods is used to describe the change of local texture features and then face is divided six regions such 
as forehead, eyes, nose, mouth, left cheek and right cheek using pseudo 3D model. The paper [8] described the 
texture feature using angled local directional pattern considering the center pixel. In reference [9], Scale 
Invariance Feature Transform method is applied to extract the unique and precise informative face features. 
The paper [3] used local binary pattern to extract local texture feature of four basic regions of face: two eyes, 
nose and mouth. To extract the dynamic texture features from the video, [4] used histogram of oriented 
gradients from three orthogonal planes (HOG-TOP). The visual features are extracted from the color image 
using convolutional neural network (CNN) as a feature descriptor in [16]. The effects of the approaches are 
time-consuming, and the characteristic dimension is huge, so the dimensionality reduction methods are used 
to affect the accuracy of facial expression recognition. 


5. FACIAL DATASETS 

Facial expression datasets have two types of creation of images: posed expressions images and 
spontaneous expressions images datasets. Researchers acquired facial images in three ways such as peak 
expression images only, image sequences portraying an emotion from neutral to its peak, and video clips with 
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emotional annotations. The two widely used datasets are CK+ and JAFFE [26-29]. The real-world facial 
databases are FER-2013, FERG-DB, SFEW2.0 (static facial expression in the wild), RAF-DB (real world 
affective face database) and AffectNet database. Sample images of basic facial expression are described in 
Table. 1 for each dataset. 


Table 1. Sample image of facial image datasets 


Sample Images 
Surprise __ : _ Disgust 


Dataset 


CK+ 


JAFFE 


FER-2013 


FERG- 
DB 


SFEW 


RAF-DB 


AffectNet 


5.1. Extended cohn-canade dataset (CK+) 

CK+ data set have been widely used in many years in facial expression system. This data set comprises 
of 593 sequences of image vary in duration from 10 to 60 frames collected from 123 subjects. The age range 
of subjects is 18-50 years, where 31% are men and 69% are women. The images express seven categories of 
expressions: happy, sad, surprise, anger, fear, disgust, and neutral that cover the basic emotions. Each image 
has 640 * 640- or 490-pixels resolution [27]. 


5.2. Japanse female facial expression dataset (JAFFE) 

JAFFE data set is also widely used in expression recognition of human emotion. This dataset consists 
of 213 images of 10 Japanese females including seven expressions: six basic (happy, surprise, sad, anger, fear 
and disgust) and neutral. Each image has the resolution of 256 * 256 pixels [28]. 


5.3. FER 2013 dataset 

FER-2013 data set contains 28,000 images that are labeled. The dataset is created in 2013 for learning 
focused on three challenges: the black box learning, the facial expression recognition challenges and the 
multimodal learning challenges. The images are 48 * 48 pixels grayscale of faces in seven expressions: six 
basic expression and neutral [30]. 
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5.4. FERG-DB dataset 

FERG-DB stands for facial expression research group database that consists of face images of six 
stylized characters grouped into seven types of expressions: six basic expressions and neutral. The dataset 
includes 555767 images [31]. 


5.5. Static facial expression in the wild dataset (SFEW) 

The images in the SFEW are extracted from a temporal facial expressions database Acted Facial 
Expressions in the Wild (AFEW) which has been extracted from movies. The database contains 700 images 
that have been labeled into six basic expressions [16]. 


5.6. Real-world affective face database (RAF-DB) 
RAF-DB database is a large-scale facial expression database that includes facial images downloaded 
from internet. The dataset is annotated seven-dimensional expression distribution vectors for each image [10]. 


5.7. AffectNet dataset 

AffeNet is a largest database of facial expression in the real-world and contains more than 1,000,000 
facial images downloaded from the internet search by six different languages with 1250 emotion related 
keywords. The database defined eleven categories of expression: six basic expressions, neutral, contempt, 
none, uncertain, and non-face [16]. 


6. PROBLEM STATEMENT 
FER system is need to develop under the problem of illumination, lighting, pose, aging, occlusion for 
the real-world expression classification system. The major challenges of the study include: 
— Most of researches classify basic emotions but fine-grain emotion is relatively small. 
— The reaearch works on mocro-expression and compound emotion recognition system are limited. 
— Mathematical model is needed to be developed for extraction more discriminant features facial images in 
the wild. 
— Real time facial expression recognition systems should be developed to meet practical application. 
— Deep learning model also need to create for improving facial feature extraction and classification. 


7. CONCLUSION AND FURTURE WORK 

Facial expression recognition is an active research area and more interesting for researcher under the 
problem of occlusion, brightness, viewing angle, pose, and background in the real-life images, sequence of 
images and videos. This review paper has presented methods of preprocessing, feature extraction and 
classification scheme. The FER research goes on to meet real-life applications for driver drowsiness 
recognition, assistant of distance learning, clinical patient monitoring and teaching robot, health care system 
for autism children. In the future, FER system will be developed for fined grained facial expressions recognition 
and compound emotions recognition by using facial images. 
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